Introduction:
The dataset used in this EDA is related to white wine samples of the Portuguese “Vinho Verde” wine.For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009].
Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).
The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines.
Attribute Information :
Input variables (based on physicochemical tests):
Output variable (based on sensory data):
In this section summary of all variables and information about dataset is analysed along with histograms for important variables and if necessary new variables are created
White Wine Dataset Summary
Null values in Dataset
## [1] 0
row count
## [1] 4898
column count
## [1] 13
Dataset Summary
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
Dataset Observations:
idealpH categorical variable Based on the information 3-3.4 is best pH level for white wines, a categorical variable idealPh is created which takes value ‘yes’ when pH level is in between 3-3.4 otherwise the value will be ‘no’
Fixed Acidity Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
Fixed Acidity plot
Volatile Acidity Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
Volatile Acidity plot
Citric Acid Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
Citric Acid plot
Residual Sugar Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
Residual Sugar plot
Total sulfur dioxide Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
Total Sulfur dioxide plot
Density Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
Density plot
pH Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
pH plot
IdealpH category variable (3-3.4pH value) Summary
## No Yes
## 834 4064
Bar plot for idealPH variable
Sulphates Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
Sulphates plot
Alcohol(% by volume) Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
Alcohol(% by volume) plot
Quality Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
Quality plot
Number of Instances in white wine Dataset : 4898.
Number of Attributes: Total 13 columns,column “X” to represent sample & remaining 12 columns represent sample attributes
Missing Attribute Values: None
dataset is tidy and there are no missing values .
residual sugar, alcohol,pH and fixed.acidity are main attributes
quality,sulphates and density can help in understanding more about wine
I have created idealpH category variable based on ideal pH range 3-3.4
- Most of the individual variables are normally distributed.
- Residual sugar distribution is skewed.
- fixed.acidity,volatile.acidity,citric.acid,total.sulfur.dioxide,density and residual.sugar has some outliers
- More than 80% of samples are in ideal pH range
- All levels of quality are not present
Based on above individual variable analysis ,in this section Bivariate Analysis is done to show comparisons and trends between two varaibles Scatterplot is a good way to analyze bivariate relationshhip , It is used. Plots are analysed for below pairs
- Negative association between fixed.acidity and pH value
- Positive association between sulphate and pH value
- Trend increased and decreased multiple times for fixed.acidity and sulphates
- Most of the sample shave residual sugar below 20 grams with some outliers
- Alochol and quality seems to have a postive correlation
- Majority samples have alcohol level above 10 % and are in ideal pH range
- Total sulfur dioxide is used to determine freshness of wine and majority samples have total sulfur dioxide above 100 which suggests that most wine samples are not aged well.
- strongest relationship is found between fixed.acidity vs pH & sulphate vs pH
In this section association between multiple variables is explored. Based on Bivariate plots Analysis below variables are analyzed together.
Relationship between Fixed Acidity , Sulphates and pH value
Relationship between Alcohol , Residual sugar and Quality
- Majority samples have fixed.acidity above 6 .
- Distribution of pH values w.r.t fixed.acidity & sulfates is normal and there are some outliers. This is further strengthened by Bivariate analysis between pH & sulphates , pH & fixed.acidity
lower pH values are present when fixed.acidity is more higher pH values are present when fixed.acidity value is less Higher quality samples are seen when alcohol is in between 8.5-10% ——
Below are 3 plots with most interesting findings
Ideal pH range of white wine is in between 3-3.4. From above plot we can see that more than 80% samples are in ideal pH range.
From plot 1 and 2 we can confirm that majority samples with ideal pH range are have quality levels in between 5 to 7 . quality has some positive association with ideal Ph value
From above plot we can confirm that when fixed.acidity value is more pH value is less and vice versa and same relation can be seen in between Sulphates and pH Which strengthens findings of individual Bivariate analysis with pH value
This is the tidiest dataset and it was easy to perform Univariate analysis. Found trends between fixed acidity & ph , suphates & ph. Interesting find is majority samples have total.sulfur.dioxide above 100 For Bivariate Analysis I couldnt figure out main attributes and supporting attributes initially, resulting in some re-work. After some research on white wine I was able to determine.Which suggests that prior knowledge of dataset is required to make proper analysis. Some of the main attributes of wine like age, tannins,types of grapes etc are not mentioned in dataset which would have helped in understanding more about quality.